NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Tag-grounded Visual Instruction Tuning with Retrieval Augmentation

https://doi.org/10.18653/v1/2024.emnlp-main.120

Qi, Daiqing; Zhao, Handong; Wei, Zijun; Li, Sheng (November 2024, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing)

Full Text Available
Sequence-to-Segments Networks for Detecting Segments in Videos

https://doi.org/10.1109/TPAMI.2019.2940225

Wei, Zijun; Wang, Boyu; Hoai, Minh; Zhang, Jianming; Shen, Xiaohui; Lin, Zhe; Mech, Radomir; Samaras, Dimitris (March 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence)
null (Ed.)
Full Text Available
Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning

Yang, Zhibo; Huang, Lihan; Chen, Yupei; Wei, Zijun; Ahn, S.; Samaras, Dimitris; Hoai, Minh (June 2020, IEEE Conference on Computer Vision and Pattern Recognition (CVPR))

Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer’s internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of highquality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive targetdependent patterns of object prioritization, which we interpret as a learned object context.
more » « less
Full Text Available
Learning Visual Emotion Representations From Web Data

https://doi.org/10.1109/CVPR42600.2020.01312

Wei, Zijun; Zhang, Jianming; Lin, Zhe; Lee, Joon-Young; Balasubramanian, Niranjan; Hoai, Minh; Samaras, Dimitris (June 2020, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
null (Ed.)
Full Text Available
Reading Detection in Real-time

Kelton, Conor; Wei, Zijun; Ahn, Seoyoung; Balasubramanian, Aruna; Das, Samir R; Samaras, Dimitris; Zelinsky, Gregory (June 2019, Symposium on Eye Tracking Research And Application)

Observable reading behavior, the act of moving the eyes over lines of text, is highly stereotyped among the users of a language, and this has led to the development of reading detectors–methods that input windows of sequential fixations and output predictions of the fixation behavior during those windows being reading or skimming. The present study introduces a newmethod for reading detection using Region Ranking SVM (RRSVM). An SVM-based classifier learns the local oculomotor features that are important for real-time reading detection while it is optimizing for the global reading/skimming classification, making it unnecessary to hand-label local fixation windows for model training. This RRSVM reading detector was trained and evaluated using eye movement data collected in a laboratory context, where participants viewed modified web news articles and had to either read them carefully for comprehension or skim them quickly for the selection of keywords (separate groups). Ground truth labels were known at the global level (the instructed reading or skimming task), and obtained at the local level in a separate rating task. The RRSVM reading detector accurately predicted 82.5% of the global (article-level) reading/skimming behavior, with accuracy in predicting local window labels ranging from 72-95%, depending on how tuned the RRSVM was for local and global weights. With this RRSVM reading detector, a method now exists for near real-time reading detection without the need for hand-labeling of local fixation windows. With real-time reading detection capability comes the potential for applications ranging from education and training to intelligent interfaces that learn what a user is likely to know based on previous detection of their reading behavior.
more » « less
Full Text Available
Benchmarking Gaze Prediction for Categorical Visual Search

https://doi.org/http://dx.doi.org/10.1109/CVPRW.2019.00111

Zelinsky, Greg; Yang, Zhibo; Huang, Lihan; Chen, Yupei; Ahn, S; Wei, Zijun; Adeli, Hossein; Samaras, Dimitris; Hoai, Minh (June 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))

The prediction of human shifts of attention is a widely-studied question in both behavioral and computer vision, especially in the context of a free viewing task. However, search behavior, where the fixation scanpaths are highly dependent on the viewer’s goals, has received far less attention, even though visual search constitutes much of a person’s everyday behavior. One reason for this is the absence of real-world image datasets on which search models can be trained. In this paper we present a carefully created dataset for two target categories, microwaves and clocks, curated from the COCO2014 dataset. A total of 2183 images were presented to multiple participants, who were tasked to search for one of the two categories. This yields a total of 16184 validated fixations used for training, making our microwave-clock dataset currently one of the largest datasets of eye fixations in categorical search. We also present a 40-image testing dataset, where images depict both a microwave and a clock target. Distinct fixation patterns emerged depending on whether participants searched for a microwave (n=30) or a clock (n=30) in the same images, meaning that models need to predict different search scanpaths from the same pixel inputs. We report the results of several state-of-the-art deep network models that were trained and evaluated on these datasets. Collectively, these datasets and our protocol for evaluation provide what we hope will be a useful test-bed for the development of new methods for predicting category-specific visual search behavior.
more » « less
Full Text Available

Search for: All records